AITopics | action description

Collaborating Authors

action description

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Appendix Contents

Neural Information Processing SystemsFeb-18-2026, 01:42:13 GMT

An example prompt for constructing PDDL models of the action "close a

artificial intelligence, natural language, precondition, (19 more...)

Neural Information Processing Systems

Industry: Transportation (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.92)
Information Technology > Artificial Intelligence > Natural Language (0.80)

Add feedback

Learning Human-like Representations to Enable Learning Human Values Andrea H. Wynn

Neural Information Processing SystemsFeb-11-2026, 01:30:49 GMT

How can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We explore the effects of representational alignment between humans and AI agents on learning human values.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (1.00)
Transportation > Passenger (0.46)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

Add feedback

Empirical Decision Theory

Jansen, Christoph, Schollmeyer, Georg, Augustin, Thomas, Rodemann, Julian

arXiv.org Machine LearningDec-8-2025

Analyzing decision problems under uncertainty commonly relies on idealizing assumptions about the describability of the world, with the most prominent examples being the closed world and the small world assumption. Most assumptions are operationalized by introducing states of the world, conditional on which the decision situation can be analyzed without any remaining uncertainty. Conversely, most classical decision-theoretic approaches are not applicable if the states of the world are inaccessible. We propose a decision model that retains the appeal and simplicity of the original theory, but completely overcomes the need to specify the states of the world explicitly. The main idea of our approach is to address decision problems in a radically empirical way: instead of specifying states and consequences prior to the decision analysis, we only assume a protocol of observed act--consequence pairs as model primitives. We show how optimality in such empirical decision problems can be addressed by using protocol-based empirical choice functions and discuss three approaches for deriving inferential guarantees: (I) consistent statistical estimation of choice sets, (II) consistent statistical testing of choice functions with robustness guarantees, and (III) direct inference for empirical choice functions using credal sets. We illustrate our theory with a proof-of-concept application comparing different prompting strategies in generative AI models.

action description, assumption, choice function, (15 more...)

arXiv.org Machine Learning

2512.05677

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Austria > Vienna (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
(8 more...)

Genre: Research Report (0.82)

Industry: Energy (0.67)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Decision Support Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
(2 more...)

Add feedback

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

Hori, Chiori, Masuyama, Yoshiki, Jain, Siddarth, Corcodel, Radu, Jha, Devesh, Romeres, Diego, Roux, Jonathan Le

arXiv.org Artificial IntelligenceNov-24-2025

Abstract--Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former . Experiments with the Y ouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.17335

Genre: Research Report (1.00)

Industry: Education > Educational Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies

Neau, Maëlic, Falomir, Zoe, Santos, Paulo E., Bosser, Anne-Gwenn, Buche, Cédric

arXiv.org Artificial IntelligenceNov-7-2025

Abstract-- Deploying autonomous robots that can learn new skills from demonstrations is an important challenge of modern robotics. Existing solutions often apply end-to-end imitation learning with Vision-Language Action (VLA) models or symbolic approaches with Action Model Learning (AML). On the one hand, current VLA models are limited by the lack of high-level symbolic planning, which hinders their abilities in long-horizon tasks. In this paper we present a new neuro-symbolic approach, GraSP-VLA, a framework that uses a Continuous Scene Graph representation to generate a symbolic representation of human demonstrations. This representation is used to generate new planning domains during inference and serves as an orchestrator for low-level VLA policies, scaling up the number of actions that can be reproduced in a row. Our results show that GraSP-VLA is effective for modeling symbolic representations on the task of automatic planning domain generation from observations. In addition, results on real-world experiments show the potential of our Continuous Scene Graph representation to orchestrate low-level VLA policies in long-horizon tasks. Inferring the preconditions and outcomes of actions from observations is a long-lasting challenge in robotics. These representations can then be used to compose long-horizon tasks using pre-trained low-level behaviors. In AML, we define an action as a set of initial states (i.e.

artificial intelligence, machine learning, relation, (15 more...)

arXiv.org Artificial Intelligence

2511.04357

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Mano Technical Report

Fu, Tianyu, Su, Anyang, Zhao, Chenxu, Wang, Hanning, Wu, Minghui, Yu, Zhe, Hu, Fei, Shi, Mingjia, Dong, Wei, Wang, Jiayao, Chen, Yuyang, Yu, Ruiyang, Peng, Siran, Li, Menglin, Huang, Nan, Wei, Haitian, Yu, Jiawei, Xin, Yi, Zhao, Xilin, Gu, Kai, Jiang, Ping, Zhou, Sifan, Wang, Shuo

arXiv.org Artificial IntelligenceNov-3-2025

Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.

large language model, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2509.17336

Country: Europe > Austria (0.28)

Genre:

Workflow (1.00)
Instructional Material (0.66)
Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
(2 more...)

Add feedback

3578fd44b2381db12bf16e28a667c934-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 23:04:21 GMT

agent, experiment, representational alignment, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Maryland > Baltimore (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (1.00)
Transportation > Passenger (0.46)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(4 more...)

Add feedback

SV3.3B: A Sports Video Understanding Model for Action Recognition

Kodathala, Sai Varun, Vutukoori, Yashwanth Reddy, Vunnam, Rakesh

arXiv.org Artificial IntelligenceAug-26-2025

This paper addresses the challenge of automated sports video analysis, which has traditionally been limited by computationally intensive models requiring server-side processing and lacking fine-grained understanding of athletic movements. Current approaches struggle to capture the nuanced biomechanical transitions essential for meaningful sports analysis, often missing critical phases like preparation, execution, and follow-through that occur within seconds. To address these limitations, we introduce SV3.3B, a lightweight 3.3B parameter video understanding model that combines novel temporal motion difference sampling with self-supervised learning for efficient on-device deployment. Our approach employs a DWT-VGG16-LDA based keyframe extraction mechanism that intelligently identifies the 16 most representative frames from sports sequences, followed by a V-DWT-JEPA2 encoder pretrained through mask-denoising objectives and an LLM decoder fine-tuned for sports action description generation. Evaluated on a subset of the NSVA basketball dataset, SV3.3B achieves superior performance across both traditional text generation metrics and sports-specific evaluation criteria, outperforming larger closed-source models including GPT-4o variants while maintaining significantly lower computational requirements. Our model demonstrates exceptional capability in generating technically detailed and analytically rich sports descriptions, achieving 29.2% improvement over GPT-4o in ground truth validation metrics, with substantial improvements in information density, action complexity, and measurement precision metrics essential for comprehensive athletic analysis. Model Available at https://huggingface.co/sportsvision/SV3.3B.

computer vision, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2507.17844

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Sports (0.94)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition

Wu, Wenhan, Guo, Zhishuai, Chen, Chen, Xue, Hongfei, Lu, Aidong

arXiv.org Artificial IntelligenceJun-30-2025

Zero-shot skeleton-based action recognition aims to develop models capable of identifying actions beyond the categories encountered during training. Previous approaches have primarily focused on aligning visual and semantic representations but often overlooked the importance of fine-grained action patterns in the semantic space (e.g., the hand movements in drinking water and brushing teeth). To address these limitations, we propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition. FS-VAE consists of three key components: 1) a frequency-based enhancement module with high- and low-frequency adjustments to enrich the skeletal semantics learning and improve the robustness of zero-shot action recognition; 2) a semantic-based action description with multilevel alignment to capture both local details and global correspondence, effectively bridging the semantic gap and compensating for the inherent loss of information in skeleton sequences; 3) a calibrated cross-alignment loss that enables valid skeleton-text pairs to counterbalance ambiguous ones, mitigating discrepancies and ambiguities in skeleton and text features, thereby ensuring robust alignment. Evaluations on the benchmarks demonstrate the effectiveness of our approach, validating that frequency-enhanced semantic features enable robust differentiation of visually and semantically similar action clusters, improving zero-shot action recognition.

large language model, machine learning, recognition, (17 more...)

arXiv.org Artificial Intelligence

2506.22179

Country:

North America > United States > North Carolina (0.04)
North America > United States > Illinois (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Filters

Collaborating Authors

action description

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

f9f54762cbb4fe4dbffdd4f792c31221-Supplemental-Conference.pdf

A Appendix Contents

Learning Human-like Representations to Enable Learning Human Values Andrea H. Wynn

Empirical Decision Theory

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies

Mano Technical Report

3578fd44b2381db12bf16e28a667c934-Paper-Conference.pdf

SV3.3B: A Sports Video Understanding Model for Action Recognition

Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition